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Abstract 

We  propose  a  machine  architecture  for  a  high-performance  processing  node  for  a  message¬ 
passing,  M1MD  concurrent  computer.  The  principal  mechanisms  for  attaining  this  goal  are 
the  direct  execution  and  buffering  of  messages  and  a  memory-based  architecture  that 
permits  very  fast  context  switches.  Our  architecture  also  includes  a  novel  memory 
organization  that  permits  both  indexed  and  associative  accesses  and  that  incorporates  an 
instruction  buffer  and  message  queue.  Simulation  results  suggest  that  this  architecture 
reduces  message  reception  overhead  by  more  than  an  order  of  magnitude. 
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Architecture  of  a  Message-Driven  Processor  1 

William  J.  Dally,  Linda  Chan,  Andrew  Chien,  Soha  Haaaoun,  Waldemar  Horwat, 
Jon  Kaplan,  Paul  Song,  Brian  Totty,  and  Scott  Willa 

Artificial  Intelligence  Laboratory  •*!  Laboratory  for  Computer  Science 
Maaeachuaetta  lnatitute  of  Technology 
Cambridge,  Maaeachuaetta  02119 


We  propose  a  machine  architecture  for  a  high-performance  processing 
node  for  a  mane  age- parsing,  MIMD  concurrent  computer.  The  principal 
mechaniama  for  attaining  thie  goal  are  the  direct  execution  and  buffer¬ 
ing  of  aoeaeages  and  a  memory-baaed  architecture  that  permits  eery  faat 
content  switches.  Our  architecture  also  includes  a  novel  memory  orga¬ 
nisation  that  permits  both  indexed  and  associative  acceaeea  and  that 
incorporates  an  instruction  buffer  and  message  queue.  Simulation  re¬ 
sults  suggest  that  this  architecture  reduces  message  reception  overhead 
by  more  than  an  order  of  magnitude 


1  Introduction 
1.1  Summary 

The  menu  ago-driven  procannor  (MDP)  in  a  procening  node  for  a 
message- pasting  concurrent  computer.  It  is  designed  to  support 
fine-grain  concurrent  programs  by  reducing  the  overhead  and  la¬ 
tency  associated  with  receiving  a  message,  by  reducing  the  time 
neccaaary  to  perform  a  context  switch,  and  by  providing  hardware 
support  for  object-oriented  concurrent  programming  ayatems. 

Message  handling  overhead  is  reduced  by  directly  executing  mes¬ 
sages  rather  than  interpreting  them  with  sequences  of  instruc¬ 
tions.  As  shown  in  Figure  1,  the  MDP  contains  two  control  units, 
tha  instruction  unit  (IU)  that  executes  instructions  and  the  mes¬ 
sage  unit  (MU)  that  executes  messages  When  a  message  arrives 
it  is  examined  by  the  MU  which  decides  whether  to  queue  the  mes¬ 
sage  or  to  execute  the  message  by  preempting  the  IU  Messages 
are  snqueued  without  interrupting  the  IU.  Message  execution  is 
accomplished  by  immediately  vectoring  the  IU  to  the  appropriate 
memory  address  Special  registers  are  dedicated  to  the  MU  so  no 
time  is  wasted  saving  or  restoring  state  when  switching  between 
massage  and  instruction  execution. 


Context  switch  time  is  reducsd  by  making  the  MDP  a  memory 
rather  than  register  baaed  processor  Each  MDP  instruction  may 
read  or  write  one  word  of  memory.  Because  the  MDP  memory 
is  oo-chip,  these  memory  references  do  not  slow  down  instruction 
execution.  Four  general  purpose  registers  are  provided  to  allow 
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Figure  1:  Message  Driven  Processor  Organisation 


iDMtructiooi  that  require  up  to  three  operands  to  execute  in  s 
single  cycle.  The  entire  state  of  a  context  may  be  saved  or  restored 
in  leas  than  10  clock  cycln.  Two  register  sets  are  provided,  one 
for  each  of  two  priority  levels,  to  allow  low  priority  messages  to 
be  preempted  without  saving  state 

The  MDP  memory  can  be  accesaed  either  by  address  or  by  con¬ 
tent,  as  a  set-associative  cache.  Cache  access  is  used  to  provide 
address  translation  from  object  identifier  to  object  location.  This 
translation  mechanism  is  used  to  support  a  global  address  space. 
Object  identifiers  in  tbs  MDP  are  global.  They  are  translated  at 
run  time  to  find  tbe  node  on  which  the  object  resides  and  tbs 
address  within  this  node  st  which  the  object  starts. 

Tbs  associative  access  of  tbe  MDP  memory  is  also  used  to  look 
up  tbs  method  to  be  executed  in  response  to  a  message.  The 
cache  acts  as  an  ITLB  [3]  and  translates  a  (elector  (from  the 
message),  and  class  (from  the  receiver)  into  tbe  starting  address 
of  the  method.  Because  the  MDP  maintains  a  global  name  space, 
it  is  not  necessary  to  kssp  a  copy  of  tbs  program  cods  (and  tha 
operating  system  cods)  at  each  nods.  Each  MDP  keeps  a  method 
cache  in  its  memory  and  fetches  methods  from  a  single  distributed 
copy  of  the  program  on  cache  misses. 


'Th<  research  described  in  thie  paper  «u  aponeored  by  the  De¬ 
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The  MDP  it  •  tagged  machine  Tags  are  uted  both  to  tupport 
dynamically-typed  programming  languages  and  to  support  con¬ 
current  programming  constructs  such  as  futures  |8'. 

The  MOP  it  intended  to  tupport  a  fine-grain,  object-oriented  con¬ 
current  programming  tyttem  in  which  a  collection  of  objects  in¬ 
teract  by  patting  messages  (lj.  In  such  a  tyttem,  addresses  are 
object  names  (identifiers).  Execution  it  invoked  by  tending  a  mes¬ 
sage  specifying  a  method  to  be  performed,  and  poatibly  some  ar¬ 
guments  to  an  object  When  an  object  receives  a  message  it  looks 
up  and  executes  the  corresponding  method.  Method  execution 
may  involve  modifying  the  object's  state,  tending  messages,  and 
creating  new  objects.  Because  the  messages  are  short  (typically 
6  words),  and  the  methods  are  short  (typically  20  instructions)  it 
is  critical  that  the  overhead  involved  in  receiving  a  message  and 
in  switching  tasks  to  execute  the  method  be  kept  to  a  minimum 

1.3  Background 

Several  message-passing  concurrent  computers  have  been  built  us¬ 
ing  conventional  microprocessors  for  processing  elements  Exam¬ 
ples  of  this  class  of  machines  include  the  Cosmic  Cube  (13],  the  In¬ 
tel  iPSC  [7],  and  the  S-NET  ]2].  The  software  overhead  of  message 
interpretation  on  these  machines  is  about  300*ji  The  message  is 
copied  into  memory  by  a  DMA  controller  or  communication  pro¬ 
cessor.  The  node's  microprocessor  then  takes  an  interrupt,  saves 
its  current  state,  fetches  the  message  from  memory,  and  interprets 
the  message  by  executing  a  sequence  of  instructions  Finally,  the 
message  is  either  buffered  or  the  method  specified  by  the  message 
is  executed. 

This  large  overhead  restricts  programmers  to  using  coarse-grained 
concurrency.  The  code  executed  in  response  to  each  message  must 
\  •  run  for  at  least  a  millisecond  to  achieve  reasonable  (75%)  effi¬ 
ciency.  Much  of  the  potential  concurrency  in  an  application  can¬ 
not  ba  exploited  at  this  coa-se  grain  site.  For  many  applications 
the  natural  grain-sixe  is  about  20  instruction  times  [4]  (5*is  on  a 
high-performance  microprocessor).  Two-hundred  times  as  many 
processing  elements  could  be  applied  to  a  problem  if  we  could 
efficiently  run  programs  with  a  granularity  of  Sps  rather  than  1 
ms. 

For  many  of  the  early  message-passing  machines,  the  network 
latency  was  several  milliseconds,  making  the  software  overhead  a 
minor  concern.  However,  recent  developments  in  communication 
networks  for  these  machines  [5]  [6]  have  reduced  network  latency 
to  a  few  microseconds  making  software  overhead  a  major  concern. 


Tbs  MDP  is  not  the  first  processing  element  designed  explicitly  for 
a  massage-passing  concurrent  computer  The  N-CUBE  family  of 
parallel  processors  is  built  around  a  single  chip  processing  element 
that  is  used  in  conjunction  with  external  memory  [11].  The  Mo¬ 
saic  processor  integrates  the  processor,  memory,  and  communica¬ 
tion  unit  all  on  one  chip  [10].  Neither  of  these  processors  addresses 
tbs  issue  of  massage  reception  overhead.  The  N-CUBE  processor 
usas  DMA  and  interrupts  to  handle  its  messages,  while  the  Mosaic 
receives  messages  one  word  at  a  time  using  programmed  transfers 
out  of  receive  registers.  Closer  in  spirit  to  the  MDP  is  the  The 
InMOS  Transputer  [0].  The  Transputer  supports  a  static,  syn¬ 
chronous  model  of  programming  baaed  on  CSP  (12)  in  much  the 
same  way  that  the  MDP  supports  a  dynamic  asynchronous  model 
based  on  actors  (lj. 


Some  of  the  ideas  uted  in  the  MDP  have  been  borrowed  from  other 
processors.  Multiple  register  sets  have  been  used  in  microproces 
sors  such  as  the  Zilog  Z  80  [16],  and  in  microcoded  processors 
such  as  the  XEROX  Alto  [IS].  The  Alto  uses  its  multiple  register 
sets  to  perform  micro- tasking.  By  switching  between  the  register 
sets,  context  switches  can  be  made  on  microinstruction  boundaries 
with  no  state  saving  required  Spector  [14]  used  micro-tasking  on 
the  Alto  to  implement  remote  operations  over  an  Ethernet,  an 
idea  similar  to  direct  method  execution 

1.3  Outline 

The  remainder  of  this  paper  describes  the  MDP  in  detail.  The 
user  architecture  of  the  MDP  is  presented  in  Section  2.  The  ma¬ 
chine  state,  message  set,  and  instruction  set  are  discussed  The 
MDP  micro  architecture  is  the  topic  of  Section  3.  This  section  in¬ 
cludes  a  description  of  our  novel  memory  architecture.  Section  4 
discusses  support  for  concurrent  execution  models.  We  show  bow 
a  programming  system  that  combines  reactive  objects,  dynamic 
typing,  fetch-and-op  combining,  and  futures  cen  be  efficiently  im¬ 
plemented  on  the  MDP  Performance  estimates  for  the  MDP  are 
discussed  in  Section  5. 


2  User  Architecture 

2.1  Machine  State 

The  programmer  sees  the  MDP  as  a  4K-word  by  3&-bit/word 
array  of  read-write  memory  (RWM),  a  small  read-only  memory 
(ROM),  and  a  collection  of  registers. 

The  MDP  registers  are  shown  in  Figure  2.  The  registers  are  di¬ 
vided  into  instruction  registers  and  message  registers.  There  are 
two  sets  of  instruction  registers,  one  for  each  of  two  priority  levels. 
Each  set  consists  of  four  general  registers  R0-R3.  four  address  reg¬ 
isters  A0-A3.  and  an  instruction  pointer  IP.  The  general  registers 
are  36  bits  long  (32  data  bits  +  4  tag  bits)  and  are  used  to  hold 
operands  and  results  of  arithmetic  operations. 

The  28-bit  address  registers  are  divided  into  14-bit  base  and  limit 
fields  that  point  to  the  base  and  limit  addresses  of  an  object  in 
the  node's  local  memory.  Associated  with  each  addrew  register 
is  an  invalid  bit,  and  a  queue  bit.  The  invalid  bit  is  act  when 
the  register  does  not  contain  a  valid  address  The  queue  bit  is 
set  when  the  register  is  used  to  reference  the  current  message 
queue.  Addrtae  registers  are  not  saved  on  a  context  switch  since 
the  object  they  point  to  may  be  relocated.  Instead,  the  object’s 
identifier  (OID)  is  re-translated  into  the  object’*  base  and  limit 
addresses  when  the  context  is  restored.  All  address  registers  as 
well  a*  the  queue  and  translation  buffer  registers,  appsar  to  the 
programmer  to  have  two  adjacent  14-bit  field*. 

The  instruction  pointer  is  a  16-bit  register  that  is  used  to  fetch 
instructions.  The  low  order  14-bits  select  a  word  of  memory,  bit 

14  selects  one  of  the  two  instructions  packed  in  the  word,  and  bit 

15  determines  whether  the  IP  is  an  absolute  address,  or  an  offset 
into  AO.  Because  instructions  are  prefctched,  the  value  of  the  IP 
may  be  ahead  of  the  next  instruction 
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Figure  2:  MDP  Registers 

The  small  register  set  allows  a  context  switch  to  be  performed 
very  quickly.  Only  five  registers  must  be  saved  and  nine  registers 
restored.  Because  the  on-chip  memory  can  be  accessed  in  a  single 
clock  cycle,  the  fact  that  few  intermediate  results  can  be  kept  in 
registers  does  not  significantly  degrade  performance 

The  meaaage  registers  consist  of  two  sets  of  queue  registers,  a 
translation  buffer  base/mask  register,  and  a  status  register.  A  set 
of  queue  registers  is  provided  for  each  of  the  two  receive  queues 
Each  queue  register  set  contains  a  28-bit  base/limit  register,  and 
a  28-bit  head/tail  register.  The  queue  baae/limit  register  contains 
14-bit  pointers  to  the  first  and  last  words  allocated  to  the  queue 
while  the  head/tail  register  contains  14-bit  pointers  to  the  first 
and  last  words  that  hold  valid  data.  As  with  the  address  registers 
all  these  14-bit  fields  contain  physical  addresses  into  local  memory 
Special  address  hardware  is  provided  to  enqueue  or  dequeue  a 
word  in  a  single  clock  cycle. 

We  have  omitted  a  send  queue  from  the  MDP  for  two  reasons 
First,  analysis  of  the  networks  we  plan  to  use  (6|  indicate  that 
the  network  will  be  able  to  accept  messages  as  fast  as  the  nodes 
can  generate  them.  Secood,  if  network  congestion  does  occur,  the 
absence  of  a  send  queue  allows  the  congestion  to  act  as  a  gov¬ 
ernor  on  objects  producing  messages  With  a  send  queue,  these 
objects  srould  fill  their  respective  queues  before  they  blocked.  Be¬ 
cause  both  the  MDP  and  the  network  support  multiple  priority 
levels,  higher  priority  objects  will  be  able  to  execute  and  clear  the 
congestion. 

The  translation  buffer  base/mask  register  is  used  to  generate  ad¬ 
dressee  when  using  the  MDP  memory  as  a  set-associative  cache 
This  register  contains  a  14-bit  base  and  a  14-bit  mask  As  shown 
in  Figure  3,  each  bit  of  the  the  mask,  MASK,,  selects  between  a 
bit  of  the  association  key,  KEY,,  and  a  bit  of  the  base,  BASE,,  to 


ADDR, 


MASK  t — 

Figure  3:  Translation  Buffer  Address  Formation 

generate  the  corresponding  address  bit,  ADDR,.  The  high  order 
ten  bits  of  the  resulting  address  are  used  to  select  the  memory  row 
in  which  the  key  might  be  found.  The  operation  of  the  memory 
as  a  set-associative  cache  is  described  in  Section  3.2. 

The  status  register  contains  a  set  of  bits  that  reflect  the  current 
execution  slate  of  the  MDP  including  current  priority  level,  a 
fault  status  bit,  and  an  interrupt  enable  bit. 

2.2  Message  Set 

The  MDP  controller  is  driven  by  the  incoming  message  stream. 
The  arrival  of  a  message  causes  some  action  to  be  performed  by 
the  MDP.  This  action  may  be  to  read  or  write  a  memory  loca¬ 
tion,  execute  a  sequence  of  instructions,  and/or  send  additional 
messages  The  MDP  controller  reacts  to  the  arrival  of  a  message 
by  scheduling  the  execution  of  a  code  sequence. 

Rather  than  providing  a  large  message  set  hard-wired  into  the 
MDP,  we  chose  to  implement  only  a  single  primitive  message, 
EXECUTE.  This  message  takes  as  arguments  a  priority  level  <priorlty> 
(0  or  1),  an  opcode  <opcode>,  and  an  optional  list  of  arguments, 
<arg>.  The  message  opcode  is  a  physical  address  to  the  routine 
that  implements  the  message.  More  complex  messages,  such  as 
those  that  invoke  a  method  or  dereference  an  identifier,  can  be 
implemented  as  almost  as  efficiently  using  the  EXECUTE  message 
as  they  could  if  they  were  hard-wired 

EXECUTE  <prierity>  <opcode>  <arg>  .  <arg> 

When  a  message  arrives  at  a  message-driven  processor,  it  is  buffered 
until  the  node  is  either  idle  or  executing  code  at  lower  priority 
level  If  the  node  is  already  executing  at  a  lower  priority,  no 
buffering  is  required.  This  buffering  takes  place  without  inter¬ 
rupting  the  processor,  by  stealing  memory  cycles.  The  processor 
then  examines  the  header  of  the  message  and  dispatches  control 
to  an  instruction  sequence  beginning  at  the  <opcode>  field  of  the 
message  in  physical  memory.  Saving  state  is  not  required  as  the 
new  message  is  executed  in  the  high  priority  registers.  Message 
arguments  are  read  under  program  control.  The  processor's  con¬ 
trol  unit  rather  than  software,  decides  (1)  whether  to  buffer  or 
execute  the  message  and  (2)  what  address  to  branch  to  when  the 
message  is  accepted. 

In  the  MDP,  all  messages  do  result  in  the  execution  of  instructions. 

The  key  difference  is  that  no  instructions  are  required  to  receive 
or  buffer  the  message,  and  very  few  instructions  are  required  to 
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locale  .the  code  to  be  executed  in  response  to  the  message  The 
MDP  provides  efficient  mechanisms  to  buffer  messages  in  memory, 
to  synchronise  program  execution  with  message  arrival,  and  to 
transfer  control  rapidly  in  response  to  a  message  By  performing 
these  functions  in  hardware  (not  microcode),  their  overhead  is 
reduced  to  a  few  clock  cycles  (<  500ns). 

We  choose  not  to  implement  complex  messages  in  microcode  be¬ 
cause  they  will  run  just  as  fast  using  macrocode  and  implementing 
them  in  macrocode  gives  us  more  flexibility.  Since  the  MDP  is 
an  experimental  machine  we  place  s  hith  value  on  providing  the 
Sexibility  to  experiment  with  different  concurrent  programming 
models  and  different  message  sets,  and  to  instrument  the  system. 
The  MDP  usee  a  small  ROM  to  hold  the  code  required  to  execute 
the  message  types  listed  below.  The  ROM  code  uses  the  macro 
instruction  set  and  lies  in  the  same  address  space  as  the  RWM,  so 
it  is  very  easy  for  the  user  to  redefine  these  messages  simply  by 
specifying  a  diffeient  start  address  in  the  header  of  the  message 
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Instruction  Format 

lor  field.  The  operand  descriptor  can  be  used  to  specify.  (1)  a 
memory  location  using  a  offset  (short  integer  or  register)  from 
an  address  register,  (2)  a  short  integer  or  bit-field  constant,  (3) 
access  to  the  message  port,  or  (4)  access  to  any  of  the  prncsssnr 
registers 

In  addition  to  the  usual  data  movement,  arithmetic,  logical,  and 
control  instructions,  the  MDP  provides  instructions  to: 


IW  <bsss>  <ll>lt>  <r»ply-nod«>  <reply-sel> 

WIITT  <bsss>  <lialt>  <dsts>  . . .  <dits> 

UAD-mU)  «obj-ld>  <lndex>  <?eply-ld>  <reply-se!> 
WRITZ-FIZLD  <obJ-ld>  clndax>  <dats> 

DUZ7CUNCE  <eld>  <reply-ld>  <reply-sel> 

MV  <slse>  <date>  ...  <dati>  <reply*ld>  <reply-eel> 

CALL  <aetbod-ld>  <xrg>  . .  <srg> 

•END  <recelver-id>  <selector>  <srg>  . . .  <arg> 

REPLY  <context-ld>  <lsdex>  <data> 

FORWARD  <coatrol»  <dsts>  . . .  <dsts> 

COMBINE  <obJ-id»  <srg>  . . .  <srg>  <reply-id>  <reply-sel> 
CC  <obj-id>  <aark> 

The  READ,  WRITE,  READ-FIELD,  WRITE-FIELD,  DEREFERENCE,  and 
NEW  messages  are  used  to  resd  or  write  memory  locations  READ 
WRITE  read  and  write  blocks  of  physical  memory  They  deal  only 
with  physical  memory  addresses,  <base>  <lislt>,  and  physical 
node  addresses,  <rsply-node>  The  READ-FIELD  and  WRITE-FIELD 
read  and  write  a  field  of  a  named  object  These  messages  use  logi¬ 
cal  addresses  (object  identifiers),  <obj-ld>,  <reply-id>,  and  will 
work  even  if  their  target  is  relocated  to  another  memory  address, 
or  another  node.  The  DEREFERENCE  method  reads  the  entire  con¬ 
tents  of  an  object  HEW  creates  a  new  object  with  the  specified 
contents  (optional)  and  returns  an  identifier  The  <reply-sel> 
(reply-selector)  field  of  the  read  messages  specifies  the  selector  to 
be  used  in  the  reply  message 

The  CALL  and  SEND  messages  cause  a  method  to  be  executed.  The 
method  is  specified  directly  in  the  CALL  message,  <aethod-ld>. 
In  the  6END  message,  the  method  is  determined  at  run-time  de¬ 
pending  on  the  class  of  the  receiver. 

The  REPLY ,  FORWARD,  COMBINE,  and  CC  messages  are  used  to  im¬ 
plement  /stares,  message  multicast,  feteh-and-op  combining,  and 
garbage  collection  respectively. 

3.3  Instruction  Set 

Each  MDP  instruction  is  17-bits  in  length.  Two  instructions  are 
packed  into  each  MDP  word  (the  INST  tag  is  abbreviated)  Each 
instruction  may  specify  at  most  one  memory  access.  Registers  or 
constants  supply  all  other  operands 

As  shown  in  Figure  4,  each  instruction  contains  s  6-bit  opcode 
field,  two  2-bit  register  select  Gelds,  and  an  7-bit  operand  descrip- 


•  Read,  write,  and  check  tag  fields 

•  Look  up  the  data  associated  with  a  key  using  the  TBM  reg¬ 
ister  and  set-associative  features  of  the  memory 

•  Enter  s  key /data  pair  in  the  association  table 

•  Transmit  a  message  word 

•  Suspend  execution  of  a  method 

All  instructions  are  type  checked  Attempting  an  operation  on 
the  wrong  class  of  data  results  in  s  trap.  Traps  are  also  provided 
for  arithmetic  overflow,  for  translation  buffer  miss,  for  illegal  in¬ 
struction,  for  message  queue  overflow,  etc. ... 

3  Micro  Architecture 

Figure  5  shows  a  block  diagram  of  the  MDP.  Messages  arrive 
at  the  network  interface.  The  message  unit  (MU)  controls  the 
reception  of  these  messages,  and  depending  on  the  status  of  the 
instruction  unit  (IU),  either  signals  the  IU  to  begin  execution, 
or  buffers  the  message  in  memory  The  IU  executes  methods  by 
controlling  the  registers  and  arithmetic  units  in  the  data  path, 
and  by  performing  read,  write,  and  translate  operations  on  the 
memory.  While  the  MU  and  IU  are  conceptually  separate  units, 


Figure  5:  MDP  Block  Diagram 
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in  the  current  implementation  they  are  combined  into  a  single 
controller 


S.l  Data  Path 

A*  shown  in  Figure  6,  the  data  path  is  divided  into  two  sections 
The  arithmetic  section  (left)  consists  of  two  copies  of  the  general 
registers,  and  an  arithmetic  unit  (ALU)  The  ALU  unit  accepts 
one  argument  from  the  register  file,  one  argument  from  the  data 
bus,  and  returns  its  result  to  the  register  file 

The  address  section  (right)  consists  of  the  address,  queue,  IP, 
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Figure  6:  MDP  Data  Path 


mg  using  additional  address  comparators  to  provide  spare  mem¬ 
ory  rows  that  can  be  configured  at  power-up  to  replace  defective 
rows 


Figure  7:  MDP  Memory  Block  Diagram 


and  TBM  registers  and  an  address  arithmetic  unit  (AAU)  Each 
register  in  the  address  section  holds  two  1 4- bit  fields  that  are 
bit-interleaved  so  that  corresponding  bits  of  the  two  fields  can 
be  easily  compared.  The  AAU  generates  memory  addresses,  and 
may  modify  the  contents  of  a  queue  register  In  a  single  cycle  it 
can  (I)  perform  a  queue  insert  or  delete  (with  wraparound),  (2) 
insert  portions  of  a  key  into  a  base  field  to  perform  a  translate 
operation,  (3)  compute  an  address  as  an  offset  from  an  address 
register's  base  Geld  and  check  the  address  against  the  limit  Geld, 
or  (4)  fetch  an  instruction  word  and  increment  the  corresponding 
IP 


3.3  Memory  Design 


The  MDP  memory  is  used  both  for  normal  read/write  operations, 
and  as  a  set-associative  cache  to  translate  object  identiGers  into 
physical  addresses  and  to  perform  method  lookup.  These  trans¬ 
lation  operations  are  performed  as  shown  in  Figure  8.  The  TBM 
register  selects  the  range  of  memory  rows  that  contain  the  trans¬ 
lation  table  The  key  being  translated  selects  a  particular  row 
within  this  range  Comparators  built  into  the  column  multiplexor 
compare  the  key  with  each  odd  word  in  the  row  If  a  comparator 
indicates  a  match,  it  enables  the  adjacent  even  word  onto  the  data 
bus  If  no  comparator  matches  the  data  a  miss  is  signaled,  and 
the  processor  takes  a  trap  For  clarity,  Figure  8  shows  the  words 
brought  out  separately  In  fact,  to  simplify  multiplexor  layout, 
the  words  in  a  row  are  bit-interleaved 


-J 

| 

5 


A  block  diagram  of  the  MDP  memory  is  shown  in  Figure  7  The 
memory  system  consists  of  s  memory  array,  a  row  decoder,  a 
column  multiplexor  and  comparators,  and  two  row  buffers  (one 
for  instruction  fetch  and  one  for  queue  access)  Word  sizes  in 
this  Ggure  are  for  our  prototype  which  will  have  only  IK  words  of 
RWM 

In  tbe  prototype,  the  memory  array  will  be  a  256-row  by  144- 
column  array  of  3  transistor  DRAM  cells  In  an  industrial  version 
of  the  chip,  a  4K  word  memory  using  1  transistor  cells  would  be 
feasible.  We  wanted  to  provide  simultaneous  memory  access  for 
data  operations,  instruction  fetches,  and  queue  inserts;  however, 
to  achieve  bigb  memory  density  we  could  not  alter  the  basic  mem¬ 
ory  cell.  Making  a  dual  port  memory  would  double  the  area  of  the 
basic  cell.  Instead,  we  have  provided  two  row  buffers  that  cache 
one  memory  row  (4  words)  each  One  buffer  is  used  to  hold  the 
row  from  which  instructions  are  being  fetched  The  other  holds 
tbe  row  in  which  message  words  are  being  enqueued  Address 
comparators  are  provided  for  each  row  buffer  to  prevent  normal 
accesses  to  these  rows  from  receiving  stale  data  We  are  consider- 
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Figure  8:  Associative  Memory  Access 
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S.S  Area  Estimate 

Our  data  path*  um  a  pitch  of  60A  (A  it  half  the  minimum  de¬ 
sign  rule)  per  bit  giving  a  height  of  2160A  We  expect  the  data 
path  to  be  *>  3000A  wide  for  an  area  of  w  6.SMX1.  A  IK  word 
memory  array  built  from  3T  DRAM  cellt  will  have  dimension!  of 
m  2450A  x  61S0A  %  15MA1  We  expect  the  memory  peripheral 
circuitry  to  add  an  additional  5MA*.  We  plan  to  ute  an  on  chip 
communication  unit  similar  to  the  Torus  Routing  Chip  |5|  which 
will  take  an  additional  4AfA*.  Allowing  8AfA*  for  wiring  gives  a 
total  chip  area  of  »  40Af A*  (or  a  chip  about  6  5mm  on  a  tide  in 
2m  CMOS)  for  our  IK  word  prototype. 


4  Execution  Model 
4.1  CALL  and  SEND 

In  a  concurrent,  object-oriented  programming  system,  programs 
operate  by  sending  messages  to  objects  Each  method  results  in 
the  execution  of  a  method  The  MDP  supports  this  model  of 
programming  with  the  CALL  and  SEND  messages 

The  execution  sequence  for  a  CALL  message  is  shown  in  Figure  9 
The  first  word  of  the  message  contains  the  priority  level  (0),  and 

Mfinory 


Figure  B:  Processing  a  CALL  Message 


the  physical  address  of  the  CALL  subroutine.  If  the  processor  is 
idle,  in  the  clock  cycle  following  receipt  of  this  word,  the  first  in¬ 
struction  of  the  call  routine  is  fetched  The  call  routine  then  reads 
the  object  identifier  for  the  method.  This  identifier  is  translated 
into  a  physical  address  in  a  tingle  clock  cycle  using  the  transla¬ 
tion  table  in  memory  If  the  translation  misses,  or  if  the  method 
is  not  resident  in  memory,  a  trap  routine  performs  the  translation 
or  fetches  the  method  from  a  global  data  structure 

Once  the  method  code  is  found,  the  CALL  routine  jumps  to  this 
code.  The  method  code  may  then  read  in  arguments  from  the 
queue.  This  is  accomplished  by  setting  the  queue-bit 


of  AS  on  message  arrival.  Subsequent  accesses  through  A3  read 
words  from  the  message  queue  If  the  method  faults,  the  message 
is  copied  from  the  queue  to  the  heap  Register  A3  it  set  to  point  to 
the  message  in  the  heap  when  the  code  is  resumed  The  argument 
object  identifiers  are  translated  to  physical  memory  base/hmil 
pairs  using  the  translate  instruction.  If  the  method  needs  space 
to  store  local  state,  it  may  create  a  context  object  When  the 
method  has  finished  execution,  or  when  it  needs  to  wait  for  a 
reply,  it  executes  a  SUSPEND  instruction  passing  control  to  the 
next  message. 

A  SEND  message  looks  up  its  method  based  on  a  selector  in  the 
message,  and  the  class  of  the  receiver.  This  method  lookup  is 
shown  in  Figure  10.  The  receiver  identifier  it  translated  into  a 
base/limit  pair.  Using  this  address,  the  clast  of  the  receiver  is 
fetched.  The  class  is  concatenated  with  the  selector  field  of  the 
message  to  form  a  key  that  is  used  to  look  up  the  physical  address 
of  the  method  in  the  translation  table.  Once  the  method  is  found, 
processing  proceeds  as  with  the  CALL  message. 


Figure  10:  Method  Lookup 


4.3  Non-Local  References  and  Futures 

If  either  operand  of  an  instruction  is  not  of  the  proper  type,  a  trap 
will  occur.  This  hardware  support  for  run-time  type  checking  not 
only  allows  us  to  support  dynamically-typed  languages  such  as 
LISP  and  Smalltalk,  but  also  allows  us  to  handle  local  and  non¬ 
local  data  uniformly.  For  example,  suppose  we  attempt  to  access 
an  instance  variable  of  an  object  using  the  instruction  tsap  <- 
anObject  at:  aFleld.  If  anObjsct  is  resident  on  the  local  node 
a  simple  memory  reference  is  generated;  however,  if  anObject  is 
resident  on  a  different  node  a  message  send  results  This  uniform 
handling  of  objects  regardless  of  their  location  relieves  the  pro¬ 
grammer  and  the  compiler  from  keeping  track  of  object  locations 
More  importantly,  it  facilitates  dynamically  moving  objects  from 
node  to  node. 
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FVtsree  in  supported  through  the  uh  of  tagi  Conuder  the  in- 
■traction  mentioned  in  the  previoui  paragraph:  taap  <-  anObject 
•t :  afield.  If  enOb]  act  ie  not  local,  a  mcaaage  will  be  aent  with 
the  haply -To:  alot  of  the  maaaage  apecifying  the  variable  taap 
in  the  currant  context,  and  taap  will  be  tagged  aa  a  context 
fatare.  Whan  the  reply  maaaage  arrivae,  aa  ahown  in  Figure  11, 
it  looka  up  the  context  object,  and  overwrite!  the  apecified  alot 
arith  the  proper  value.  In  the  meantime,  execution  continue!  until 
the  program  attempt!  to  uae  the  value  in  taap  perhapa  by  execut¬ 
ing  aVar  <-  taap  *  1  If  when  thia  inetruction  examinee  taap 
it  ia  still  tagged  Future,  the  current  context  is  suspended  until 
the  value  of  taap  ia  available.  If  the  at:  maaaage  had  already 
replied  arith  the  value  of  taap,  however,  the  tag  of  taap  would 
have  signified  a  value  and  the  context  would  not  be  suspended. 

Futures  can  be  handled  in  a  more  general  sense  by  creating  an 
object  of  daaa  future  to  which  the  pending  computation  ia  to  re¬ 
ply.  References  to  thia  future  object  may  then  be  passed  outside 
of  the  local  context.  When  the  result  of  the  pending  computation 
ie  available,  the  future  object  becoaea  thia  value. 


Figure  11:  Processing  A  Reply  Menage 
4.S  Multicast  and  Combining 

In  concurrent  computations  it  ia  often  necessary  to  fan  data  out 
to  many  destinations,  and  to  accumulate  data  from  many  sources 
with  aa  associative  operator  In  the  MDP,  these  functions  are 
performed  by  the  FORWARD  and  COMBINE  messages  respectively 

The  FORWARD  message  contains  the  identifier  of  a  control  object, 
and  a  maaaage  to  be  forwarded  as  apecified  in  that  object.  The 
control  object  ia  a  list  of  destinations  to  which  the  message  should 
ba  forwarded  along  with  the  header  (if  any)  which  should  precede 
the  message.  When  the  maaaage  arrives,  the  control  object  is  lo¬ 
cated  and  a  buffer  ia  created  in  memory  to  hold  the  message .  The 
maaaage  ie  read  into  ths  buffer  and  at  the  same  time  transmitted 
to  the  firat  destination  in  the  list.  The  message  is  then  transmit¬ 
ted  to  the  subsequent  destinations  on  the  list,  and  the  buffer  is 
deallocated. 

The  combine  maaaage  specifies  the  identifier  of  a  combine  object, 
and  a  maaaage  to  be  combined  or  forwarded.  The  combine  object 
contains  ths  destination  to  which  combined  messages  are  to  be 


forwarded,  buffers  for  combined  messages  awaiting  a  reply,  and 
identifiers  for  the  methods  to  be  executed  in  response  to  combine 
or  reply  messages.  The  combining  performed  is  controlled  entirely 
by  these  user  specified  methods  The  combine  message  ia  quite 
similar  to  a  CALL  differing  only  in  that  the  method  to  be  executed 
is  implicit. 


6  Performance 


We  have  constructed  both  instruction- level  and  a  register-transfer 
(RT)  level  simulators  for  the  MDP.  Using  these  simulators  ws  have 
evaluated  ths  time  required  by  the  MDP  to  perform  a  number  of 
simple  operations.  These  operations  are  tabulated  in  Thble  1. 

In  this  table,  W  specifies  the  number  of  words  transferred,  and 
N  specifies  the  number  of  destinations  for  the  FORWARD  massage. 
The  times  for  CALL,  SIND,  and  COMBINE  are  the  time  from  message 
reception  until  the  first  word  of  the  appropriate  method  is  fetched. 
Times  are  expressed  in  clock  cycles  We  expect  the  clock  period 
of  our  prototype  to  be  100ns 


Operation _ 

WRITE 

READ-FIELD 

WRITE-FIELD 

DEREFERENCE 

CALL _ 

SEND _ 

REPLY _ 

FORWARD 

COMBINE 


Time 

5  +  W 
4  +  W 

7 
6 

6  +  W 

5 

8 
7 

S  -f  N  x  W 
5 


Table  1:  MDP  Message  Execution  Times  (in  clock  cycles) 


In  the  near  future  we  plan  to  run  benchmarks  on  a  simulated 
collection  of  MDPs  to  measure  the  hit  ratios  in  translation  buffer 
and  method  cache  (as  a  function  of  cache  size),  and  effectiveness 
of  the  row  buffers. 


6  Conclusion 

The  message-driven  processor  (MDP)  is  able  to  process  a  set  of 
messages  that  support  an  object-oriented  concurrent  program¬ 
ming  system  with  an  overhead  of  less  than  ten  clock  cycles  per 
massage.  This  performance,  more  than  an  order  of  magnitude 
improvement  over  existing  message-passing  systems,  enables  the 
MDP  to  efficiently  run  programming  systems  that  exploit  con¬ 
currency  at  a  grain  size  of  «v  10  instructions  In  contrast  existing 
machines  operate  efficiently  only  at  a  grain  size  of  several  hun¬ 
dred  instructions.  Ws  conjecture  that  by  exploiting  concurrency 
at  this  fine  grain  size  we  will  be  able  to  achieve  an  order  of  mag¬ 
nitude  more  concurrency  for  a  given  application  than  is  possible 
on  existing  machines. 

The  MDP  achieves  much  of  its  performance  by  using  a  massage- 
driven  control  mechanism.  The  MU  handles  reception  and  buffer¬ 
ing  of  arriving  messages  as  well  as  directing  the  operation  of  the 
IU.  The  IU  simply  executes  instructions.  It  never  makes  a  deci- 
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•ion  concerning  whether  to  buffer  or  execute  in  arriving  meatige 
For  each  message,  it  i«  vectored  to  the  proper  entry  point  by  the 
MU.  A  (ingle  meiaage  type,  EXECUTE,  with  two  priority  levels, 
provides  all  the  mechanism  necessary  to  implement  a  concurrent 
programming  system 

The  MDP  uses  a  memory  based  instruction  set  and  two  register 
sets  to  implement  fast  context  switches  The  dual  register  sets 
allow  a  high  priority  message  to  interrupt  a  lower  priority  message 
without  saving  stale  The  memory  based  instruction  set  allows 
a  context  to  save  its  state  in  five  clock  cycles  Operating  out  of 
memory  is  almost  as  fast  as  operating  out  of  registers  since  the 
memory  is  implemented  on  chip  and  can  be  accessed  in  a  single 
clock  cycle. 

The  MDP  memory  adds  functionality  in  its  peripheral  circuitry 
while  preserving  the  density  of  a  simple  memory  array  The  mem¬ 
ory  supports  both  indexed  and  associative  access  by  placing  com¬ 
parators  in  the  column  multiplexor.  The  associative  access  mech¬ 
anism  speeds  the  execution  of  concurrent  programs  by  allowing 
address  translation  and  method  lookup  to  be  performed  in  a  sin¬ 
gle  clock  cycle  This  translation  mechanism  is  made  visible  to  the 
programmer  so  it  can  be  applied  in  other  situations  (e  g  ,  method 
lookup). 

The  MDP  has  been  motivated  by  the  development  of  high-performance 
message-passing  networks  [5] .  In  early  message  passing  machines, 
message  latency,  in  the  milliseconds,  was  the  limiting  factor  Now 
that  the  message  latency  has  been  reduced  to  a  few  microsec¬ 
onds,  we  can  no  longer  ignore  processor  latencies  in  hundreds  of 
microseconds 

Some  may  argue  that  the  MDP  is  unbalanced  according  to  the  rule 
of  thumb  stating  that  a  1MIP  processor  should  have  a  1MByte 
memory  The  MDP  is  an  ra  4MIP  processor  and  only  has  a 
16KByte  memory  (4KByte  in  the  prototype)  We  argue  however 
that  it  is  not  the  sue  of  the  memory  in  a  single  node  that  is  im¬ 
portant,  but  rather  the  amount  of  memory  that  can  be  accessed 
in  a  given  period  of  time.  In  s  64K  node  machine  constructed 
from  MDPs  and  using  a  fast  routing  network,  a  processor  will  be 
able  to  access  a  uniform  address  space  of  2**  words  (230  Bytes)  in 
leas  than  10ps 

The  MDP  provides  many  of  the  advantages  of  both  message¬ 
passing  multicomputera  and  shared-memory  multiprocessors  Like 
a  shared-memory  machine,  it  provides  a  single  global  name  space, 
and  nseds  to  keep  only  a  single  copy  of  the  application  and  oper¬ 
ating  system  code.  Like  a  message- passing  machine,  the  MDP  ex¬ 
ploits  locality  in  object  placement,  uses  messages  to  trigger  events, 
and  gains  efficiency  by  sending  a  single  message  through  the  net¬ 
work  instead  of  sending  multiple  words.  While  we  plan  to  im¬ 
plement  an  object-oriented  programming  system  on  the  MDP,  we 
also  sec  the  MDP  as  an  emulator  that  can  be  used  to  experiment 
with  other  programming  models 
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their  suggestions  on  how  to  improve  this  paper 
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